Updated: Mon Mar 4 15:54:55 2019

1 IP Addresses for Cybersecurity

1.1 Retrieving Data

Read the dataset in with proper column names:

##    user  system elapsed 
##   0.489   0.055   0.617

Applying pre-processing on our dataframe:

library(tidyr)

ipaddr <- separate(data=ipaddr, 
         col="Coords", 
         into=c("Lat", "Lng"), 
         sep = ",", 
         remove=TRUE)

ipaddr[,c("Lat", "Lng")] <- lapply(ipaddr[,c("Lat", "Lng")], as.numeric)
## Classes 'tbl_df', 'tbl' and 'data.frame':    258626 obs. of  9 variables:
##  $ IP         : chr  "222.76.212.189" "222.76.212.185" "222.76.212.186" "5.34.246.67" ...
##  $ Reliability: int  4 4 4 6 4 4 4 4 4 6 ...
##  $ Risk       : int  2 2 2 3 5 2 2 2 2 3 ...
##  $ Type       : chr  "Scanning Host" "Scanning Host" "Scanning Host" "Spamming" ...
##  $ Country    : Factor w/ 152 levels "CN","US","UA",..: 1 1 1 2 3 2 1 1 1 NA ...
##  $ Locale     : Factor w/ 2572 levels "Xiamen","Merefa",..: 1 1 1 NA 2 3 1 1 1 NA ...
##  $ Lat        : num  24.5 24.5 24.5 38 49.8 ...
##  $ Lng        : num  118.1 118.1 118.1 -97 36.1 ...
##  $ x          : chr  "11" "11" "11" "12" ...

Simple visualization of the dataframe:

library(ggplot2)
library(hrbrthemes)

# get top 20 countries
top_countries <- names(table(ipaddr$Country)[1:20])

ggplot(data=subset(ipaddr, Country %in% top_countries),
       aes(x=reorder(Country, Country, length))) + 
  geom_bar() +
  coord_flip() +
  labs(title="Country by appearance frequency", 
       subtitle="Number of countries in the IP reputation database",
       x="Country", y="Count") +
  theme_ft_rc()

ggplot(data=subset(ipaddr, Country %in% top_countries),
       aes(x=reorder(Country, Lng), y=Lng)) + 
  geom_boxplot() +
  coord_flip()+
  labs(title="Median of Lng by Country")+
  theme_ipsum_rc()

An IPv4 address consists of 32 bits. This becomes clear when you convert the decimal notation into the binary equivalent: 201.105.7.34 corresponds to 11001001 01101001 00000111 00100010. In our dataset, our each of the IP address is also given a score on its Risk and Reliability:

##      Reliability      1      2      3      4      5      6      7      8      9     10
## Risk                                                                                  
## 1                     0      0     16      7      0      8      8      0      0      0
## 2                   804 149114   3670  57653      4   2084     85     11    345     82
## 3                  2225      3   6668  22168      2   2151    156      7    260     79
## 4                  2129      0    481   6447      0    404     43      2     58     24
## 5                   432      0     55    700      1    103      5      1     20     11
## 6                    19      0      2     60      0      8      0      0      1      0
## 7                     3      0      0      5      0      0      0      0      2      0
##    Risk Reliability Frequency
## 9     1           3        16
## 15    1           4         7
## 25    1           6         8
## 31    1           7         8
## 1     2           1       804
## 7     2           2    149114

## # A tibble: 6 x 10
##   IP    Reliability  Risk Type  Country Locale   Lat    Lng x    
##   <chr>       <int> <int> <chr> <fct>   <fct>  <dbl>  <dbl> <chr>
## 1 222.…           4     2 Scan… CN      Xiamen  24.5  118.  11   
## 2 222.…           4     2 Scan… CN      Xiamen  24.5  118.  11   
## 3 222.…           4     2 Scan… CN      Xiamen  24.5  118.  11   
## 4 5.34…           6     3 Spam… US      <NA>    38    -97   12   
## 5 178.…           4     5 Scan… UA      Merefa  49.8   36.1 11   
## 6 66.2…           4     2 Scan… US      Union…  37.6 -122.  11   
## # … with 1 more variable: simple_type <fct>
##   Risk Reliability Type Freq
## 1    1           1  C&C    0
## 2    2           1  C&C    0
## 3    3           1  C&C    0
## 4    4           1  C&C    1
## 5    5           1  C&C    2
## 6    6           1  C&C    1

1.2 IP Addresses and Bits

1.2.1 About Bits

  • A “bit” (binary digit) is atomic: it stores just a 0 or 1 and is the smallest unit of storage
  • Anything with two separate states can store 1 bit
  • A bit is too small to be much use, we typically group 8 bits to form 1 byte
  • A byte (collection of 8 bits) look like this: 0110 1001
  • One byte can store one character, e.g. “A” or “?” or “x”
  • One byte can make 256 different patterns (and thus hold a number between 0 and 255), making it suitable for storing characters / letters
    • A is 65
    • B is 66
    • space is 32
    • ~ is 126
  • 4 bytes can store numbers between -2147483648 and 2147483647
    • 2^(4*8)=4,294,967,296
  • 8 bytes can store numbers between -9223372036854775808 and 9223372036854775807

  • IP addresses stored using the dotted-decimal range format goes from 0.0.0.0 to 255.255.255.255, which is 32 bits. In the possible address space, this means there is a total of 4,294,967,296 possible addresses (the maximum value of a 32-bit integer). This has a special significance: Any IP addresses can be converted to/from a 32-bit integer value. This is important because the integer representation saves both space and time and we can calculate some things easier with this representation than with a dotted-decimal form.
  • If we perceive (or use a tool that perceives) IP addresses as a character, we are potentially wasting space by trading a 32-bit to 120-bit. We also trade away efficiency by using string comparison code versus integer arithmetic to accomplish some task.

1.2.2 About IP Addresses

  • IP networks may be divided into subnetworks. For this reason, an IP address is recognized as consisting of two parts
    • The network prefix in the high-order bits
    • The remaining bits called the rest field used for host numbering within a network
  • This original address definition means the first eight bits (octet) of the 32-bit address is the most important as it, being the network number field, specified the particular network a host was attached to. The remaining 24 bits uniquely identified a host connected to that network. This format was sufficient at a time when only a few large networks existed (since only the first octet specifies the network number, only 254 indepedent networks are supported at that time). This necessitates the introduction of address classes, and hence a new addressing architecture was introduced in 1981 as part of the specification of the Internet Protocol. This new architecture divides the address space into three classes (class A, B, C).

  • The number of addresses usable is always \(2^N-2\) where N is the number of bits. The subtraction of 2 adjusts for the use of the all-bits-zero host portion for network address and the all-bits-one host portion as a broadcast address

  • The later versions of the IPv4 addressing system broadly separates IP addresses into 5 classes, which can be identified by the first octet of IP address. There are 4 octets so here by “first octet” we’re referring to the 127 in “127.100.100.0” (similarly, 0 is the fourth octet).

Classful addressing definition:

Class Theoretical Address Range Binary Start Used for
A 0.0.0.0 to 127.255.255.255 0 Very large networks
B 128.0.0.0 to 191.255.255.255 10 Medium networks
C 192.0.0.0 to 223.255.255.255 110 Small networks
D 224.0.0.0 to 239.255.255.255 1110 Multicast
E 240.0.0.0 to 247.255.255.255 1111 Experimental
  • This architecture extended the addressing capability in the Internet but did not prevent IP address exhaustian. Many sites required larger address blocks than a Class C network provided (\(2^{8}-2 = 254\)) and thus receive a Class B block, rapidly depleting the availability of Class B addresses. Classful networking was replaced by Classless Inter-Domain Routing (CIDR) in 1993 to solve this problem.

  • n: network ID
  • H: host ID
  • Class D and E not shown for brevity:

Classes (A to C shown) Bit-wise representation Notation
Class A start: 0.0.0.0 00000000.00000.00000.00000.0000 0nnnnnnn HHHHHHHH HHHHHHHH HHHHHHHH
Class A end: 127.255.255.255 01111111.00000.00000.00000.0000 0nnnnnnn HHHHHHHH HHHHHHHH HHHHHHHH
Class B start: 128.0.0.0 10000000.00000000.00000000.00000000 10nnnnnn nnnnnnnn HHHHHHHH HHHHHHHH
Class B end: 191.255.255.255 10111111.11111111.11111111.11111111 10nnnnnn nnnnnnnn HHHHHHHH HHHHHHHH
Class C start: 192.0.0.0 11000000.00000000.00000000.00000000 110nnnnn nnnnnnnn nnnnnnnn HHHHHHHH
Class C end: 223.255.255.255 11011111.11111111.11111111.11111111 110nnnnn nnnnnnnn nnnnnnnn HHHHHHHH

1.2.3 Feature Engineering: IP Addresses to Integer and vice versa

We’re going to be making use of the unlist and strsplit functions to perform numerical computations on our IP addresses. Here’s a primer if you’re unfamiliar with the two functions:

## [1] "192" "168" "100" "0"
## [1] 192 168 100   0

We will also be using a left reduce function, which applies a function to the elements of a vector:

## [1] 460
## [1] 460

Combining these two, we can now write the IP to Integers function. Don’t worry about the functions from bitops for now:

library(bitops)
# Define function for converting IP addresses to/from integers 
# take an IP address string in dotted octets (e.g. 192.168.0.1)
# and convert it to a 32-bit long integer (3232235521)

ip2long <- function(ip){
  # convert string into vector of characters
  # 255.0.0.17 -> c(255, 0, 0, 17)
  ips <- unlist(strsplit(ip, ".", fixed=TRUE))
  # bit-shift, then "OR" the octets
  octet <- function(x, y){
    bitOr(bitShiftL(x,8), y)
  }
  Reduce(octet, as.integer(ips))
}

# take a 32-bit IP address (3232235521)
# and convert it to a dot notation
long2ip <- function(longip){
  # set up reversing bit manipulation
  octet <- function(nbits){
    bitAnd(bitShiftR(longip, nbits), 0xFF)
  }
  paste(Map(octet, c(24, 16, 8, 0)), sep="", collapse=".")
}
## [1] "192.168.0.1"
## [1] 3232235521

1.2.3.1 Feature Engineering: Extract Classes from IP Addresses

In the context of classful IP Addresses, one may think of writing an extraction function to determine the class membership of an IP address based on what we’ve learned. However, to the extent that CIDR has rendered classful IP addresses obsolete. Protocols can no longer assume they can deduce the prfix length or subnet mask from the first part of the address; Instead, all prefix lengths are explicitly communicated: classless.

1.3 Classless Inter-Domain Routing (CIDR)

Recall that in the classful structure,

  • Class A delegations contained 16777216 addresses each
  • Class B delegations contained 65536 addresses each
  • Class C delegations contained 256 addresses each

This was very inefficient for networks that didn’t fit these sizes. A network that needed 4096 addresses would either get sixteen Class C delegations (which would be bad for the global routing table because each of them would have to be routed separately: the class size was built into the protocol) or they would get one Class B delegation (which would waste a lot of addresses).

In 1993 CIDR was introduced. The protocols were adjusted to be able to deal with prefixes of different sizes and it became possible to route (both internally and externally) prefixes like a /30 or a /21 or a /15 etc etc. Anything between /0 and /32 became possible. Organisations that needed 16384 addresses could get a /18: exactly what they would need instead of the Class B delegation that consume 65,534 addresses.

CIDR encompasses the concept of variable-length subnet masking (VLSM) technique, which allows for the specification of arbitrary-length prefixes. CIDR introduced a new representation of IP addresses by prefixing an address with a suffix indicating the number of bits of the prefix, such as 192.0.2.0/24 for IPv4 and 2001:db8::/32 for IPv6. CIDR also introduces an administrative process of allocating address blocks to organizations based on their actual and short-term projected needs.

CIDR is based on the idea of subnet masks. A mask is placed over an IP address and creates a sub network: a network that is subordinate to the internet. The subnet mask signals to the router which part of the IP address is assigned to the hosts (the individual participants of the network) and which determines the network.

Instead of adding a subnet mask, a specification in the form of suffixes can also be integrated directly into the IP address using classless inter-domain routing. An integral fact pertaining CIDR is the concept of VLSM: the variable length subnet mask allows subnets to be realized with variable lengths and not only in size order of the network classes.

1.3.1 CIDR Notation

CIDR notation is a compact representation of an IP address and its associated routing prefix. The notation is constructed from an IP address, a slash (‘/’) character, and a decimal number. The number is the count of leading 1 bits in the subnet mask. Larger values here indicate smaller networks. The maximum size of the network is given by the number of addresses that are possible with the remaining, least-significant bits below the prefix.

  • 192.168.100.14/24 represents the IPv4 address 192.168.100.14 and its associated routing prefix 192.168.100.0, or equivalently, its subnet mask 255.255.255.0, which has 24 leading 1-bits
  • the IPv4 block 192.168.100.0/22 represents the 1024 IPv4 addresses from 192.168.100.0 to 192.168.103.255

1.3.2 Primer: Classful to classless domain routing infrastructure

An IP address has two components: - Network address
- Host address

A subnest mask separates the IP address into the network and host address (). Subnetting further divides the host part of an IP address into a subnet and host address () if additional subnetworks are needed.

A subnet mask is referred to as such because it identifies network addresses of an IP addresses by performing a bitwise AND operation on the netmask.

##  [1] 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [1] 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

A subnet mask is a 32-bit number that masks an IP address: it is made by setting network bits to all 1s and setting host bits to all 0s. Examples of commonly used netmasks for classed networks are 8-bits (Class A), 16-bits (Class B) and 24-bits (Class C), and classless networks are as follows:

Class Address # of Hosts Netmask (Binary) Netmask (Decimal)
CIDR /4 240,435,456 11110000 00000000 00000000 00000000 240.0.0.0
CIDR /5 134,217,728 11111000 00000000 00000000 00000000 248.0.0.0
CIDR /6 67,108,864 11111100 00000000 00000000 00000000 252.0.0.0
A /8 16,777,216 11111111 00000000 00000000 00000000 255.0.0.0
CIDR /9 8,388,608 11111111 10000000 00000000 00000000 255.128.0.0
B /16 65,534 11111111 11111111 00000000 00000000 255.255.0.0
CIDR /17 32,768 11111111 11111111 10000000 00000000 255.255.128.0
CIDR /18 16,384 11111111 11111111 11000000 00000000 255.255.192.0
CIDR /22 1,024 11111111 11111111 11111100 00000000 255.255.252.0
C /24 256 11111111 11111111 11111111 00000000 255.255.255.0
CIDR /30 4 11111111 11111111 11111111 11111100 255.255.255.252

To see the conversion between binary to integers, we can use the strtoi function in R. For example, in the case of /4 and /5, the respective binary addresses begin with:

## [1] 240
## [1] 248

From the examples above, we observe the CIDR notation written as the address of a network followed by a slash and ending with the bit-length of the prefix. 198.51.100.0/24 thus allocated 24 bits for the network prefix with the remaining 8 bits reserved for host addressing. In other words: addresses in the range of 198.51.100.0 to 198.51.100.255 belong to this subnet.

Subnet masks are also expressed in dot-decimal notation like an address. For example, 255.255.255.0 is the subnet mask for the prefix 198.51.100.0/24. Traffic is exchanged between subnetworks through routers when the routing prefixes of the source address and the destination address differ. A router serves as a logical or physical boundary between the subnets.

1.3.3 Why subnetting?

In the address allocation architecture of the Internet using CIDR and in large organizations, it is necessary to allocate address space efficiently. Subnetting may also enhance routing efficiency, or have advantages in network management when subnetworks are administratively controlled by different entities in a larger organization. Subnets may be arranged logically in a hierarchical architecture, partitioning an organization’s network address space into a tree-like routing structure.

Another advantage lies in security since all hosts in a subnetwork see all packets transmitted by any node in a network. In addition, performance of a network is adversely affected under heavy traffic load due to collisions and retransmissions.

We’ve learned that applying a subnet mask to an IP address separates network address from host address. The network bits are represented by the 1’s in the mask, and the host bits are represented by 0’s. When we apply the Class C subnet mask to the IP address 216.3.128.12, we get through the bit-wise AND operator:

IP: 1101 1000.0000 0011.1000 0000.0000 1100 (216.003.128.012) Mask: 1111 1111.1111 1111.1111 1111.0000 0000 (255.255.255.000)

Produces: 1101 1000.0000 0011.1000 0000.0000 0000 (216.003.128.000)

## [[1]]
## [1] 1 1 0 1 1 0 0 0
## 
## [[2]]
## [1] 0 0 0 0 0 0 1 1
## 
## [[3]]
## [1] 1 0 0 0 0 0 0 0
## 
## [[4]]
## [1] 0 0 0 0 1 1 0 0

Another scenario where subnetting is used is one where a web host with a Class C network needing to divide the network such that parts of its network can be leased to its customers. Assuming that the network address given is 216.3.128.0 and the company wants to divide the network into 2: first half for internal usage and second half to its tenants / customers:
- 216.3.128.(0000 0000): assigned to web host
- 216.3.128.(1000 0000): assigned to customers

The web host will have the subnet mask of 216.3.128.128 (/25). Now, we’ll further divide the 2nd half into eight block of 16 IP addresses: - 216.3.128.128 (strtoi("1000000", base=2)) Customer 1 gets 16 IPs (14 usable)
- 216.3.128.144 (strtoi("1001000", base=2)) Customer 2 gets 16 IPs (14 usable)
- 216.3.128.160 (strtoi("1010000", base=2)) Customer 3 gets 16 IPs (14 usable)
- …

Subnet mask of 255.255.255.(1111 0000) / subnet mask of 255.255.255.240

## [1] 128
## [1] 144
## [1] 160

The classful address scheme (Class A, B and C) of allocating IP addresses in 8-bit increments can be very wasteful. With classful IP addressing, a minuimum number of IP addresses allocated to any organization is 256 (Class C). When a company only requires 15 but receive 256 addresses, waste is incurred. An organization that requires more than 256 (for example, 500) is assigned a Class B, which allocates 65,536 and an organization that is assigned Class A is allocated 16.8 million IP addresses.

With CIDR, a network of IP addresses is allocated in 1-bit increments as opposed to 8-bits in classful network. The use of a CIDR notated address can easily represent classful addresses (Class A = /8, Class B = /16, and Class C = /24). The number next to the slash (i.e. /8) represents the number of bits assigned to the network address.

When we’re performing CIDR-based analyses, a common task is to determine whether an address falls within a given CIDR range. In R code, we are converting both the IP address and network block address to integer then performing the necessary bitwise operations to see if they do match up.

ip_in_cidr <- function(ip, cidr){
 long_ip <- ip2long(ip)
 cidr_parts <- unlist(strsplit(cidr, "/"))
 cidr_range <- ip2long(cidr_parts[1])
 cidr_mask <- bitShiftL(bitFlip(0), 
                        (32-as.integer(cidr_parts[2]))
                        )
 return(bitAnd(long_ip, cidr_mask) == bitAnd(cidr_range, cidr_mask))
}

ip_in_cidr("10.0.1.15", "10.0.1.3/24")
## [1] TRUE
ip_in_cidr("10.0.1.15", "10.0.2.255/24")
## [1] FALSE

1.4 Autonomous System Number (ASN) and BGP (border gateway protocol)

With CIDR, blocks of addresses can be treated as groups of nodes: we can use them to identify network packets (incoming or outgoing traffic) and therefore classify groups of malicious / legitimate nodes. A machine learning engineer can then use these groupings to dig into the data and relationships and extract potential features.

Furthermore, once we are familiar with the CIDR prefix format, we can see how those prefixes are grouped and defined as an autonomous system (AS). An AS is a fancy term but what it really mean is a collection of connected Internet Protocol (IP) routing prefixes under the control of one or more network operators on behalf of a single administrative entity that presents a common and clearly defined routing policy to the internet.

Originally the definition required control by a single entity, typically an ISP or a very large organization with independent connections to multiple networks, that adhered to a single routing policy. That ISP must have an officially registered autonomous system number (ASN). A unique ASN is allocated to each AS and uniquely identifies each network on the internet. ASN are assigned in blocks by IANA to regional internet registries (RIRs) and the RIR in turn assign ASNs to entities within its designated area from the block assigned by IANA.

These Autonomous Systems must have an officially registered autonomous system number (ASN), which they get from their Regional Internet Registry: AFRINIC, ARIN, APNIC, LACNIC or RIPE NCC

A newer definition came into use because multiple organizationscan run Border Gateway Protocol (BGP) using private AS numbers to an ISP that connects all those organizations to the internet. BGP is a protocol designed to exchange routing and reachability information among autonomous systems (AS) on the internet; it makes routing decisions based on paths, network policies or rule-sets configured by a network administrator and is involved in making core routing decisions.

A unique ASN (AS Number) is allocated to each AS for use in BGP routing. AS numbers are important because the ASN uniquely identifies each network on the Internet.

1.5 ggplot2 Visualization of malicious nodes

Taking a peek at the dataset:

##            IPAddr Reliability Risk          Type Country Locale     LAT
## 4     5.34.246.67           6    3      Spamming      US        38.0000
## 10  174.142.46.19           6    3      Spamming                24.4798
## 16 112.216.121.87           4    3 Scanning Host      KR        37.0000
## 17 112.216.121.78           4    3 Scanning Host      KR        37.0000
## 18 112.216.121.77           4    3 Scanning Host      KR        37.0000
## 19 112.216.121.75           4    3 Scanning Host      KR        37.0000
##         LNG
## 4  -97.0000
## 10 118.0819
## 16 127.5000
## 17 127.5000
## 18 127.5000
## 19 127.5000
## 
##            C&C Malicious Host     Malware IP  Scanning Host       Spamming 
##            463           3180           5782          31854           1291

1.5.1 Peers and Best Path

When two routers have established connection for exchanging BGP information, they are referred to as BGP peers. Such BGP peers exchange routing information between them via BGP sessions that run over TCP, which is a reliable, connection oriented & error free protocol.

Once the BGP Session is established, the routers can advertise a list of network routes that they have access to and will scrutinize them to find the route with the shortest path. BGP doesn’t make sense when a peer is connected only to one other peer because that peer is always going to be the best and only path to other networks. When a network is connected to multiple peers then certain paths will be shorter, faster or more reliable than others. Google’s AS15169 peers with 265 other IPv4 peers one of which is Digital Ocean Inc.’s AS14061. Both are connected to other ISPs on the internet, but since they are peers they can also exchange routing information, giving routers within their network a shorter path of connectivity. If the neighborship is broken, their routers can re-arrange their routing tables to connect with other Autonomous Systems through other peers / AS.

Because of the relationship between ASN and BGP, it’s possible to know the adjacent neighbors of each AS; if one AS “neighborhood” is rife with malicious nodes or suspicious activity it might a leading indicator that autonomous systems around it are also harboring malicious traffic.

For comparison, if we refer to the DNSlytics for as14061: Digital Ocean and compare it against as7393: Cybercon, we can get some quick summary of the IPs that are hosted on each AS. Notice that, for example, the number of spam domains hosted on as14061 is 1101-to-1 for example as compared to 8669-to-1.

##                                Variables      as14061   as7393
## 1                               Assignee DigitalOcean Cybercon
## 2               Number of domains hosted    1,667,967   52,014
## 3         Number of adult domains hosted        5,636      754
## 4          Number of name servers hosted       70,825    1,449
## 5            Number of SPAM hosts hosted        1,515        6
## 6          Number of open proxies hosted          195        0
## 7          Number of mail servers hosted      223,061   27,664
## 8           Number of IDN domains hosted        1,620        0
## 9 Number of domains in Alexa top million       17,939      500

2 Exploratory Data Analysis on IP Addresses

In the following chapter, we’ll draw upon lessons from traditional exploratory data analysis and visual exploration techniques to help us in our process of insight discovery.

In the following subsection, we’ll start by augmenting our data with additional meta-data from the IANA IPv4 Address Space Registry. The dataset we’ll read in represent a high level grouping of IPv4 address space registry allocations - although it should be said that most of the registrants are not responsible for the activities of individual nodes. We’ll use this extra information for what it’s intended: to set up further investigations.

2.1 Augmenting IP Address with Address Space data

##   Prefix                 Designation    Date           WHOIS
## 1  000/8 IANA - Local Identification 1981-09                
## 2  001/8                       APNIC 2010-01 whois.apnic.net
## 3  002/8                    RIPE NCC 2009-09  whois.ripe.net
## 4  003/8        Administered by ARIN 1994-05  whois.arin.net
## 5  004/8         Level 3 Parent, LLC 1992-12  whois.arin.net
## 6  005/8                    RIPE NCC 2010-11  whois.ripe.net
##                                                          RDAP Status..1.
## 1                                                               RESERVED
## 2                                     https://rdap.apnic.net/  ALLOCATED
## 3                                   https://rdap.db.ripe.net/  ALLOCATED
## 4 https://rdap.arin.net/registryhttp://rdap.arin.net/registry     LEGACY
## 5 https://rdap.arin.net/registryhttp://rdap.arin.net/registry     LEGACY
## 6                                   https://rdap.db.ripe.net/  ALLOCATED
##   Note
## 1  [2]
## 2     
## 3     
## 4     
## 5     
## 6

Cleaning up iana$Prefix since it uses the old/BSD-number formatting (allows leading zeros and we don’t need to know the CIDR component):

iana$Prefix <- sub(pattern = "^(00|0)", replacement="", x = iana$Prefix, perl = TRUE)
iana$Prefix <- sub(pattern = "/8$", replacement="", x = iana$Prefix, perl = TRUE)
tail(iana)
##     Prefix Designation    Date WHOIS RDAP Status..1.     Note
## 251    250  Future use 1981-09              RESERVED     [16]
## 252    251  Future use 1981-09              RESERVED     [16]
## 253    252  Future use 1981-09              RESERVED     [16]
## 254    253  Future use 1981-09              RESERVED     [16]
## 255    254  Future use 1981-09              RESERVED     [16]
## 256    255  Future use 1981-09              RESERVED [16][17]

Next, we want to extract just the prefix from the original AlienVault list of IP addresses. We do this using sub to perform replacement on all characters from the first “.” and onwards:

data$IPPrefix <- sub("\\..*","", data$IPAddr)
head(data$IPPrefix,10)
##  [1] "222" "222" "222" "5"   "178" "66"  "222" "222" "222" "174"

We’ll now create a Designation variable by assigning the appropriate value of iana$Designation based on the data$IPPrefix values:

data$Designation <- sapply(data$IPPrefix, function(ip){
  iana[iana$Prefix == ip, "Designation"]
})

summary(data$Designation, maxsum=8)
##                    APNIC                 RIPE NCC                     ARIN 
##                    93776                    74789                    42358 
##                   LACNIC     Administered by ARIN Administered by RIPE NCC 
##                    18914                    18008                     5893 
##    Administered by APNIC                  (Other) 
##                     2615                     2273

Quickly compare the distribution of registry present in our dataset against that of the external IANA data:

combined_Designation <- merge(
  data.frame(table(data$Designation)), 
  data.frame(table(iana$Designation)), by = "Var1"
)

colnames(combined_Designation) <- c("Registry", "AlienVault_Count", "IANA_Count")

head(combined_Designation[order(combined_Designation$IANA_Count, decreasing = TRUE), ],8)
##                 Registry AlienVault_Count IANA_Count
## 3   Administered by ARIN            18008         55
## 8                  APNIC            93776         45
## 10                  ARIN            42358         36
## 30              RIPE NCC            74789         35
## 21            Future use                0         16
## 27             Multicast                0         16
## 25                LACNIC            18914          9
## 2  Administered by APNIC             2615          6

We can compute correlation between the number of malicious nodes attributed to a registry (AlienVault_Count) and the number of assigned network blocks from our IANA data. It makes sense that there would be more malicious nodes in larger groups of assigned network blocks.

## [1] 0.6898625

By default, the cor() computes the correlation using the “pearson” method, but we could as well have specified the method to be “spearman” or “kendall”.

2.1.1 Correlation Methods

The different values for our method parameter corresponds to the different correlation methods:
- pearson: \(\frac{\sum{(x-\bar{x})(y-\bar{y})}}{\sqrt{\sum{(x-\bar{x})^2 \sum{(y-\bar{y})^2}}}\) or alternatively: cov(x,y)/(sd(x)*sd(y)) - spearman: \(R_s = 1-(\frac{6\sum{d}^2}{n^3-n})\) where \(d^2\) is the squared difference between the ranks of two values and n is the number of observations
- kendall: not discussed in details here, but refers to Kendall’s Tau (Kendall’s Rank Correlation)

3 Graph-based EDA

For this chapter, we’ll work with the ZeuS botnet malware. It is a crimeware kit that steals credentials from online services like social networks, online banking accounts, FTP accounts, email accounts and other (phishing). Our data is gathered from the ZeuS tracker site, and carries the following description:

This blocklist contains the same data as the ZeuS IP blocklist (BadIPs) but with the slight difference that it doesn’t exclude hijacked websites (level 2) and free web hosting providers (level 3). This means that this blocklist contains all IPv4 addresses associated with ZeuS C&Cswhich are currently being tracked by ZeuS Tracker. Hence this blocklist will likely cause some false positives.

##               IP
## 1 101.200.81.187
## 2  103.19.89.118
## 3 103.230.84.239
## 4   103.4.52.150
## 5   103.7.59.135
## 6 104.247.219.41

3.1 Network intelligence-based Preprocessing

In the following section, we will be using some preprocessing functions from the netintel library and also some custom functions:

To see a few of the functions at work, we’ll try with BulkOrigin which retrieves the BGP Origin ASN for a list of IPv4 addresses:

BulkOrigin("182.253.162.193")
##      AS              IP       BGP.Prefix CC Registry  Allocated
## 1 17451 182.253.162.193 182.253.162.0/24 ID    apnic 2010-05-10
##                            AS.Name
## 1 BIZNET-AS-AP BIZNET NETWORKS, ID

Another example is the BulkOriginASN which retrieves BGP Origin ASN info for a list of ASN ids:

BulkOriginASN("AS17451")
##      AS CC Registry  Allocated                          AS.Name
## 1 17451 ID    apnic 2000-11-07 BIZNET-AS-AP BIZNET NETWORKS, ID

We can also retrieve BGP Peer ASN info for a list of IP addresses:

BulkPeer("182.253.162.193")
##   Peer.AS              IP       BGP.Prefix CC Registry  Allocated
## 1    6453 182.253.162.193 182.253.162.0/24 ID    apnic 2010-05-10
## 2    6939 182.253.162.193 182.253.162.0/24 ID    apnic 2010-05-10
## 3   58463 182.253.162.193 182.253.162.0/24 ID    apnic 2010-05-10
## 4  196844 182.253.162.193 182.253.162.0/24 ID    apnic 2010-05-10
##                                                                       Peer.AS.Name
## 1                                   AS6453 - TATA COMMUNICATIONS (AMERICA) INC, US
## 2                                           HURRICANE - Hurricane Electric LLC, US
## 3                               PGASCOM-AS-AP PT.PGAS TELEKOMUNIKASI NUSANTARA, ID
## 4 PIONIER-AS-AMS-IX PIONIER, National Research and Education Network in Poland, PL

Applying that function on zeus$IP, we obtain a data frame in return:

origin <- BulkOrigin(zeus$IP)
head(origin)
##       AS             IP       BGP.Prefix CC Registry  Allocated
## 1  37963 101.200.81.187   101.200.0.0/16 CN    apnic 2011-01-24
## 2 132717  103.19.89.118   103.19.89.0/24 IN    apnic 2013-03-28
## 3 132717 103.230.84.239  103.230.84.0/24 IN    apnic 2014-04-29
## 4  46062   103.4.52.150    103.4.52.0/24 ID    apnic 2011-05-19
## 5 131447   103.7.59.135    103.7.56.0/22 TH    apnic 2012-03-15
## 6  46261 104.247.219.41 104.247.216.0/21 US     arin 2014-12-09
##                                                             AS.Name
## 1 CNNIC-ALIBABA-CN-NET-AP Hangzhou Alibaba Advertising Co.,Ltd., CN
## 2    NDCTPL-IN NxtGen Datacenter & Cloud Technologies Pvt. Ltd., IN
## 3    NDCTPL-IN NxtGen Datacenter & Cloud Technologies Pvt. Ltd., IN
## 4                      LINTASBUANA-AS-ID PT. BUANA LINTAS MEDIA, ID
## 5                        POP-IDC-TH POPIDC powered by CSLoxinfo, TH
## 6                                QUICKPACKET - QuickPacket, LLC, US

3.1.1 Working with igraph

suppressPackageStartupMessages(library(igraph))
bookclub <- graph(edges=c("Samuel", "Mulia", "Mulia", "Jessica", "Jessica", "Samuel", 
                          "Andita", "Jessica", "Jessica", "Tiara", "Samuel", "Tiara"), 
                    directed = TRUE)
plot(bookclub)
title(main = "Who borrowed from who?", sub = "Source: Algoritma Book Club")

In “graph language”, vertex is a synonym for a node: the common point of two line segment. Edges (lines) are what connect two nodes in a network. In the following exercise, we’ll create a simple graph using the igraph package.

The graph (visualized):

For each IP address (origin$IP) retrieve the origin AS CC and return them as a pair to create the IP-> CC edge list

## [1] "101.200.81.187" "CN"             "103.19.89.118"  "IN"            
## [5] "103.230.84.239" "IN"

Remember how we use + vertices to add vertices? Similarly, we can also use + edges to add edges:

g <-  g + edges(unlist(ip_cc_edges))
ecount(g)
## [1] 113
vcount(g)
## [1] 226

Plotting the graph again:

plot(g, vertex.label.cex=0.4)

As an additional exercise: if you call tkplot(g) and then tinker with the various options in “Layout”, you’ll see that igraph has included an interactive graph drawing facility.

We’ll now try to simplify the graph using the simplify and delete function (both from igraph). simplify maps multiple edges to single edges. This function also provides a mechanism to specify what to do with the vertices / edge attributes in these cases: - vertex.attr.comb
- edge.attr.comb

The specification of the combination of (vertex or edge) attributes can be given as either a character scalar, a function object or a list of character scalars and/or function objects.

## [1] 226
## [1] 152
## IGRAPH 7e1b6b8 DN-- 152 113 -- 
## + attr: name (v/c), color (v/c), size (v/n), group (v/n)
## + edges from 7e1b6b8 (vertex names):
##  [1] 101.200.81.187 ->CN 103.19.89.118  ->IN 103.230.84.239 ->IN
##  [4] 103.4.52.150   ->ID 103.7.59.135   ->TH 104.247.219.41 ->US
##  [7] 109.127.8.242  ->AZ 109.229.210.250->LV 109.229.36.65  ->RU
## [10] 113.29.230.24  ->SG 120.31.134.133 ->CN 120.63.157.195 ->IN
## [13] 124.110.195.160->JP 128.210.157.251->US 139.59.36.231  ->SG
## [16] 141.8.226.58   ->CH 151.97.190.239 ->IT 162.223.94.56  ->US
## [19] 176.107.179.60 ->UA 176.107.191.119->UA 177.4.23.159   ->BR
## [22] 180.182.234.200->KR 185.35.138.22  ->NL 185.6.242.251  ->IT
## + ... omitted several edges

Notice that the number of vertices has reduced from 226 to 152. We simplified the graph by comibining common edges and delete any standalone vertices (ASNs); In graph terms, this means any vertex with a degree of 0.

The plot looks like this:

By setting arrow.size to be 0 and remove the IP addresses (name) such that it only show the ASNs, we can have a cleaner graph.

E(g)$arrow.size <- 0
V(g)[grep("\\.", V(g)$name)]$name <- ""
plot(g, vertex.label.cex=0.4)

We can use the layout_on_* or layout_as_* layout options to achieve the desired graph visualization:

# L <- layout_on_sphere(g)
# L <- layout_on_grid(g)
L <- layout_as_star(g)
plot(g, layout=L, 
     vertex.label.dist=2,
     vertex.label.cex=0.6)

We can also use layout_nicely, which choose an appropriate graph layout algorithm automatically:

set.seed(0)
L <- layout_nicely(g)
plot(g, margin=0, layout=L,
     vertex.label.dist=0.7, 
     vertex.label.cex=0.7,
     vertex.label.color="black",
     vertex.label.family="sans", 
     vertex.label.font=2,
     main="ZeuS botnet nodes clustered by Country")

4 Spatial Data and Maps

In the following chapter, we’ll be working with a dataset named zeroaccess, prepared by Bob Rudis and Jay Jacobs courtesy of Symantec. This is a list of ~810,000 latitude-longitude pairs corresponding to the client IP addresses infected with the ZeroAccess toolkit, collected over a 24-hour period during the month of July 2013.

4.1 geom_path() and coord_map()

Symantec has a synopsis on the zeroaccess trojan horse: > Trojan.Zeroaccess is a Trojan horse that uses an advanced rootkit to hide itself. It can also create a hidden file system, downloads more malware, and opens a back door on the compromised computer.

The primary motivation of this threat is to make money through pay per click advertising and does so by downloading an application that conducts web searches and clicks on the results (click fraud). It is also known to download software onto compromised computers in order to mine bitcoins for the malware creators. Bitcoin mining with a single computer is a futile activity, but when it is performed by leveraging the combined processing power of a massive botnet, the sums that can be generated is considerable. Furthermore, it opens a back door and connects to a command and control (C&C) server, which allows the remote attacker access to the compromised computer and, in turn “transforming” it into part of a wider botnet.

Let’s start by reading in the data and plotting it with the familiar ggplot2:

First thing to notice is that from the ~810k data points that we plotted, the overall distribution resembles a world map. High density areas include the Eastern half and west coast of the US, Europe, some parts in South America (around Brazil), India and Japan. China (and to a certain extent Russia), on the other hand has no density, leading some cybersecurity analysts to speculate if the Chinese government itself is responsible for the global digital pandemic. One such analyst wrote:

This sophisticated malware actually has the ability to “learn” and evolve, to become one of the most infectious computer viruses to hit the globe. It also seems that no country is safe from the virus. No country, that is, except for China. The United States, Canada, and Great Britain were hit the hardest, with the rest of the European countries trailing closely behind. However, for reasons unknown, it appears that China is emerging almost completely unscathed from the viral attack. China’s overall lack of infection has not gone unnoticed, and has left many wondering, was China itself responsible for the Zeroaccess Botnet pandemic?

It is important however, to not be drawn into making quick conclusions such as the above. At this point, the best we can have are theories, but the quick visualization exercise serve as a good guiding point - we can and ought to use EDA tools such as this to further probe the data. Later in this section we’ll plot another visualization, but instead using ggplot’s map_data() argument to load ~100k rows of map data into a data frame so we can layer it underneath the scatterplot.

We’ll also create a map projection using one of the map projection provided through the mapproject::mapproject function and made available directly from ggplot2. Here are some examples of the map projections that we could use:

suppressPackageStartupMessages(library(ggpubr))

plot1 <- ggplot(world, aes(x=long, y=lat)) +
  geom_path(aes(group=group)) + 
  coord_map("rectangular",lng=200)

plot2 <- ggplot(world, aes(x=long, y=lat)) +
  geom_path(aes(group=group)) + 
  coord_map("mollweide", xlim=c(-200,200))

plot3 <- ggplot(world, aes(x=long, y=lat)) +
  geom_path(aes(group=group)) + 
  coord_map("gilbert", xlim=c(-200, 200))

plot4 <- ggplot(world, aes(x=long, y=lat)) +
  geom_path(aes(group=group)) + 
  coord_map("sinusoidal", xlim=c(-200,200))

ggarrange(plot1, plot2, plot3, plot4, ncol=2, nrow=2, common.legend=TRUE, widths=c(1,2))

Layering geom_path, coord_map("mercator") and geom_point, we can see the countries most infected / affected by ZeroAccess:

world$region <- as.factor(world$region)
summary(world$region, maxsum=11)
##    Canada    Russia       USA Indonesia     China Greenland     Chile 
##     11573      7354      5753      3715      2710      2240      2006 
##    Norway    Brazil Australia   (Other) 
##      1985      1885      1804     53655

As an extra tip, notice that a good alternative (to geom_path) with map visualization is geom_polygon. In the following code we see one example of its use:

top10 <-subset(world, world$region %in% names(summary(world$region, maxsum=10))[1:10])

ggplot(top10, aes(x=long, y=lat, group=group)) +
  geom_polygon() +
  geom_path(color="white")+
  theme_minimal()

4.2 From IP to choropleth

Our original zeroaccess dataframe has only latitude and longitude values, but conveniently enough, we can use the map.where function to retrieve country from the values of these two columns:

library(maps)
zeroaccess$country <- map.where(database = "world", zeroaccess$long, zeroaccess$lat)
head(zeroaccess)
##        lat     long country
## 1 -10.0000 -55.0000  Brazil
## 2  38.0888 -78.5592     USA
## 3  38.9990 -84.6266     USA
## 4  48.6210   7.4944  France
## 5  43.2342 -86.2484     USA
## 6  47.0000  20.0000 Hungary

By merging that data frame with the original world data frame (6 variables), we now created 7 (+1) features. The following is a sample of the new data frame with our newly created / merged variable freq:

##                         region        long        lat group order
## 72933                  Uruguay  -58.162209 -32.566502  1442 91560
## 21595 Central African Republic   19.500977   5.127491   244 14555
## 30266                 Ethiopia   38.967773   3.520606   516 37677
## 46592                   Mexico -114.157326  27.867970   985 61574
## 73866                      USA -160.080032  21.907421  1448 91781
## 16403                   Canada  -72.903954  67.944778   350 22862
## 73066                      USA  -97.202248  26.299805  1459 91884
## 76829                      USA -140.762741  60.259132  1569 95958
## 53741                     Oman   54.527050  19.845800  1114 70351
## 51164                  Nigeria    6.589942  13.409131  1051 66698
##           subregion   freq
## 72933          <NA>    498
## 21595          <NA>      3
## 30266          <NA>    100
## 46592          <NA>   9318
## 73866        Hawaii 267352
## 16403 Baffin Island  28080
## 73066         Texas 267352
## 76829        Alaska 267352
## 53741          <NA>    571
## 51164          <NA>    499

For our choropleth to render correctly, we will perform a sorting to sort by first the group and then the order variable:

Numerically, USA account for 71% of all ZeroAccess infections within our dataset, with Canada a distant second at 15%:

##        region       freq prop
## 175       USA 1538076056 0.71
## 33     Canada  324969840 0.15
## 136    Russia   63965092 0.03
## 26     Brazil   42438890 0.02
## 11  Australia   34236312 0.02
## 72      India   28951296 0.01
## 78      Italy   16223394 0.01
## 58    Germany   15575128 0.01
## 54     France   13206545 0.01
## 105    Mexico    9457770 0.00

5 Bonus Content

5.1 dplyr system

## Warning: 'plyr' namespace cannot be unloaded:
##   namespace 'plyr' is imported by 'ggplot2', 'netintel' so cannot be unloaded

5.2 Level Plots in base R, lattice and ggplot2

From the elevation dataset, we multiply the x and y by 50 (help documentation say the coordinates are in multiples of 50 feet) and z by 10 (help documentation say the elevations are in multiples of 10). Rather than plotting the individual values what we would do instead is to create a surface that covers the whole grid region using the loess function. This function fits a local polynomial trend surface using weighted least squares to approximate the elevation across the whole region. The function for a local quadratic surface is:

We will then create a grid region that goes from 10 to 300 feet in both the x and y directions:

Use our loess model to predict and fill the grid. Notice that z is a matrix with two dimensions (x and y respectively):

Now use image() to see how our loess model fill the grid with its fitted value. Adjust the span to be 0.7 or higher and then re-run the exercise to see how our fitted values are smoother:

As a comparison, take a look at the fitted values when we use a linear model instead of our loess model:

Instead of creating a colored grid using image(), we can use levelplot from lattice:

Data on soya bean production in a uniformity trial measured at plots of size 5x5 meters and other soil properties measured in points given by the data coordinates. X and Y variable refers to the respective coordinates, PH is the soil PH value, K is the potassium content and PROD is production converted to productivity by a unit of area - hectare (ton/ha):

##      X   Y   P  PH    K    MO    SB iCone   Rend PROD
## 1  5.6 3.6 4.4 6.3 0.53 47.52 80.63  31.0 6.3980 2.56
## 2 15.2 1.6 3.7 6.3 0.38 48.74 82.23  29.5 7.0502 2.82
## 3 24.8 4.6 3.5 5.4 0.47 51.79 61.47  23.0 5.8848 2.35
## 4 34.4 5.6 3.9 5.2 0.48 57.88 59.09  24.5 8.5182 3.41
## 5 44.0 1.6 3.6 5.6 0.52 59.70 71.61  18.5 7.4749 2.99
## 6 53.6 1.6 3.7 5.2 0.45 53.61 56.98  21.5 9.3160 3.73

5.2.1 Importance of visuals in EDA

Anscombe (1973) has a nice example where he uses a constructed dataset to emphasize the importance of using graphs in statistical analysis. The data is built into R:

##   x1 x2 x3 x4   y1   y2    y3   y4
## 1 10 10 10  8 8.04 9.14  7.46 6.58
## 2  8  8  8  8 6.95 8.14  6.77 5.76
## 3 13 13 13  8 7.58 8.74 12.74 7.71
## 4  9  9  9  8 8.81 8.77  7.11 8.84
## 5 11 11 11  8 8.33 9.26  7.81 8.47
## 6 14 14 14  8 9.96 8.10  8.84 7.04
## Warning: 'plyr' namespace cannot be unloaded:
##   namespace 'plyr' is imported by 'ggplot2', 'netintel' so cannot be unloaded
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
## The following object is masked from 'package:igraph':
## 
##     crossing
## # A tibble: 4 x 6
##   set   `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
##   <chr>     <dbl>   <dbl>     <dbl>   <dbl>       <dbl>
## 1 1             9    3.32      7.50    2.03       0.816
## 2 2             9    3.32      7.50    2.03       0.816
## 3 3             9    3.32      7.5     2.03       0.816
## 4 4             9    3.32      7.50    2.03       0.817

Notice that all the four sets have the same statistical properties and even the same regression coefficients; A visual exploration however surfaced 4 very different pattern within each dataset.

5.2.2 Exercise:

  1. Compute a loess model ()
  2. Grid space (48 x 54): On X: from 2 to 146 increment by 3 On Y: from 2 to 110 increment by 2;
  3. Predict on grid with loess model
  4. Visualize

##  [1] "greenyellow"          "lightgoldenrodyellow" "lightyellow"         
##  [4] "lightyellow1"         "lightyellow2"         "lightyellow3"        
##  [7] "lightyellow4"         "yellow"               "yellow1"             
## [10] "yellow2"              "yellow3"              "yellow4"             
## [13] "yellowgreen"

Primer on Left Shift and Right shift

strtoi(00000111, base = 2)
## [1] 7
strtoi(00001110, base = 2)
## [1] 14
bitShiftL(7,1)
## [1] 14
strtoi(01000011, base = 2)
## [1] 67
# shift the bits to left by 1
strtoi(10000110, base = 2)
## [1] 134
bitShiftL(67, 1)
## [1] 134
# # shift the bits to left by 2
strtoi(100001100, base = 2)
## [1] 268
bitShiftL(67, 2)
## [1] 268
toDotNotation(268)
## [1] 0 0 0 0 1 1 0 0
toDotNotation(12)
## [1] 0 0 0 0 1 1 0 0
toDotNotation(255)
## [1] 1 1 1 1 1 1 1 1

BGP in Indonesia: https://dnslytics.com/bgp/id